Project 7 : Communicate Data Findings

Introduction:

Coronavirus disease (COVID-19) is an PANDEMIC infectious disease caused by severe acute respiratory syndrome coronavirus 2 (SARS‑CoV‑2).The coronavirus COVID-19 hass affected 212 countries and territories around the world. This new virus and disease were unknown before the outbreak began in Wuhan, China, in December 2019. COVID-19 is a pandemic affecting many countries globally.The time between exposure to COVID-19 and the moment when symptoms start is commonly around five to six days but can range from 1 – 14 days. I have gathered Covid-19 data from different sources. Dataset contains confirmed case, deaths and recovered cases, new active cases in world and in India and indian State/UnionTerritory separately.

  • Coronavirus Analysis based on Confirmed, Dead and Recovered Patients all over World.
  • Second Analysis based on Deaths & Recovered Patient in India and its States and Union territories.
  • Analysis based on State Testing.
  • Analysis based on Time series.
In [126]:
import pandas as pd
import numpy  as np
import matplotlib.pyplot as plt
import seaborn as sns
from timeit import default_timer as timer
import requests
import json
In [127]:
import matplotlib.ticker as ticker
import statsmodels.api as sm
import statsmodels.formula.api as smf
import plotly.express as px
import datetime
from plotly.subplots import make_subplots
import requests
from bs4 import BeautifulSoup 
import plotly.io as pio
from pandas.io.json import json_normalize
pio.renderers.default='notebook'
import plotly.graph_objects as go

Covid-19 data gathered from different sources:

1. On the basis of all countries: data scraped using beautiful soup https://www.worldometers.info/coronavirus/

2. Covid analysis of India:'covid_19_india.csv'    
3. Analysis based on Age:'AgeGroupDetails.csv'
4. Analysis based on ICMR Testing labs:'ICMRTestingDetails.csv'    
5. Analysis based on testing done in indian states:'StatewiseTestingDetails.csv' 

    Source : https://www.kaggle.com/sudalairajkumar/covid19-in-india

6. Analysis based on Time Series Confirmed cases all over world from :
7. Analysis based on Time Series Recovered case all over world:'time_series_covid19_recovered.csv'.
8. Analysis based on Time Series Deaths case all over world:'time_series_covid19_deaths.csv'

    Source : https://data.humdata.org/dataset/5dff64bc-a671-48da-aa87-2ca40d7abf02

Covid_19 All countries scraped using beautiful soup from worldometers.com:

* Dataset coloums:

    - Total_cases
    - Total_deaths
    - Total_Recovered
    - New_cases
    - New Deaths
    - Active_Cases
    - Serious_Case
    - TotCases/1Mpop
    - Deaths/1Mpop
    - Total Tests
    - Test/1Mpop
In [128]:
url = "https://www.worldometers.info/coronavirus/" 
req_data = requests.get(url)
soup = BeautifulSoup(req_data.text, 'html.parser') 

# x = soup.findAll("tbody")
# if x is not None and len(x) > 0:
#     section = x[0]

table = soup.find('table', attrs={'id': 'main_table_countries_today'})
header = [col_name.text.rstrip('\n').strip() for col_name in table.select('thead th')]
table_rows = table.find_all('tr')
data = []
for tr in table_rows:
    try: 
        td = tr.find_all('td')
        row = [tr.text for tr in td]
        #print(row) 
        data.append(row)
        full_data = pd.read_html(str(table))[0]
    except requests.Response.raise_for_status() as e:
        print("Error: Invalid Response Error.")
In [129]:
full_data.head(10)
Out[129]:
Country,Other TotalCases NewCases TotalDeaths NewDeaths TotalRecovered ActiveCases Serious,Critical Tot Cases/1M pop Deaths/1M pop TotalTests Tests/ 1M pop
0 World 4080142 +70,851 279280.0 +3,304 1425122.0 2375740 47667.0 523.0 35.8 NaN NaN
1 USA 1341281 +19,496 79823.0 +1,208 232360.0 1029098 16796.0 4052.0 241.0 8571364.0 25895.0
2 Spain 262783 +2,666 26478.0 +179 173157.0 63148 1741.0 5620.0 566.0 2467761.0 52781.0
3 Italy 218268 +1,083 30395.0 +194 103031.0 84842 1034.0 3610.0 503.0 2514234.0 41584.0
4 UK 215260 +3,896 31587.0 +346 NaN 183329 1559.0 3171.0 465.0 1728443.0 25461.0
5 Russia 198676 +10,817 1827.0 +104 31916.0 164933 2300.0 1361.0 13.0 5221964.0 35783.0
6 France 176658 +579 26310.0 +80 56038.0 94310 2812.0 2706.0 403.0 1384633.0 21213.0
7 Germany 171264 +676 7543.0 +33 143300.0 20421 1650.0 2044.0 90.0 2755770.0 32891.0
8 Brazil 148670 +2,778 10100.0 +108 59297.0 79273 8318.0 699.0 48.0 339552.0 1597.0
9 Turkey 137115 +1,546 3739.0 +50 89480.0 43896 1168.0 1626.0 44.0 1334411.0 15822.0

Covid_19 India CSV file :

  • Confirmed cases, Deaths, Cured cases. -Date: 30/01/2 till 06/05/20
In [130]:
covid_india = pd.read_csv('covid_19_india.csv')
In [131]:
covid_india.head(5)
Out[131]:
Sno Date Time State/UnionTerritory ConfirmedIndianNational ConfirmedForeignNational Cured Deaths Confirmed
0 1 30/01/20 6:00 PM Kerala 1 0 0 0 1
1 2 31/01/20 6:00 PM Kerala 1 0 0 0 1
2 3 01/02/20 6:00 PM Kerala 2 0 0 0 2
3 4 02/02/20 6:00 PM Kerala 3 0 0 0 3
4 5 03/02/20 6:00 PM Kerala 3 0 0 0 3
  1. 30/jan/2020 india reported first case in kerala

  2. 02/march/2020 india reported their covid_19 cases i Telengana ,Delhi and after that covid_19 cases reported to in other state and /UnionTerritory.

  3. 13/03/2020 Karnataka reported death of one patients due to covid_19

Based on Agegroup Covid_19 Cases in India

In [132]:
covid_india_age = pd.read_csv('AgeGroupDetails.csv')
covid_india_age
Out[132]:
Sno AgeGroup TotalCases Percentage
0 1 0-9 22 3.18%
1 2 10-19 27 3.90%
2 3 20-29 172 24.86%
3 4 30-39 146 21.10%
4 5 40-49 112 16.18%
5 6 50-59 77 11.13%
6 7 60-69 89 12.86%
7 8 70-79 28 4.05%
8 9 >=80 10 1.45%
9 10 Missing 9 1.30%
In [133]:
covid_india_testing = pd.read_csv('ICMRTestingDetails.csv')
covid_india_testing.tail(5)
Out[133]:
SNo DateTime TotalSamplesTested TotalIndividualsTested TotalPositiveCases
37 38 23/04/20 9:00 541789.0 525667.0 23502.0
38 39 24/04/20 9:00 579957.0 NaN NaN
39 40 25/04/20 9:00 625309.0 NaN NaN
40 41 26/04/20 9:00 665819.0 NaN NaN
41 42 27/04/20 9:00 716733.0 NaN NaN

Covid_19 India testing CSV file :

  • CSV contain Statewise Testing Positive and Negative Patients data.
In [134]:
covid_india_state_testing = pd.read_csv('StatewiseTestingDetails.csv')
covid_india_state_testing
Out[134]:
Date State TotalSamples Negative Positive
0 2020-04-17 Andaman and Nicobar Islands 1403.0 1210.0 12.0
1 2020-04-24 Andaman and Nicobar Islands 2679.0 NaN 27.0
2 2020-04-27 Andaman and Nicobar Islands 2848.0 NaN 33.0
3 2020-05-01 Andaman and Nicobar Islands 3754.0 NaN 33.0
4 2020-04-02 Andhra Pradesh 1800.0 1175.0 132.0
... ... ... ... ... ...
754 2020-04-30 West Bengal 16525.0 NaN 758.0
755 2020-05-01 West Bengal 18566.0 NaN NaN
756 2020-05-02 West Bengal 20976.0 NaN 795.0
757 2020-05-03 West Bengal 22915.0 NaN 922.0
758 2020-05-04 West Bengal 25116.0 NaN 1259.0

759 rows × 5 columns

Time Series Covid_19 case Csv file :

-TimeSeries data contain DATE and Country information.

In [135]:
covid_time_series_C= pd.read_csv('time_series_covid19_confirmed.csv')
covid_time_series_C
Out[135]:
Province/State Country/Region Lat Long 1/22/20 1/23/20 1/24/20 1/25/20 1/26/20 1/27/20 ... 4/27/20 4/28/20 4/29/20 4/30/20 5/1/20 5/2/20 5/3/20 5/4/20 5/5/20 5/6/20
0 NaN Afghanistan 33.000000 65.000000 0 0 0 0 0 0 ... 1703 1828 1939 2171 2335 2469 2704 2894 3224 3392
1 NaN Albania 41.153300 20.168300 0 0 0 0 0 0 ... 736 750 766 773 782 789 795 803 820 832
2 NaN Algeria 28.033900 1.659600 0 0 0 0 0 0 ... 3517 3649 3848 4006 4154 4295 4474 4648 4838 4997
3 NaN Andorra 42.506300 1.521800 0 0 0 0 0 0 ... 743 743 743 745 745 747 748 750 751 751
4 NaN Angola -11.202700 17.873900 0 0 0 0 0 0 ... 27 27 27 27 30 35 35 35 36 36
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
261 NaN Western Sahara 24.215500 -12.885800 0 0 0 0 0 0 ... 6 6 6 6 6 6 6 6 6 6
262 NaN Sao Tome and Principe 0.186360 6.613081 0 0 0 0 0 0 ... 4 8 8 14 16 16 16 23 174 174
263 NaN Yemen 15.552727 48.516388 0 0 0 0 0 0 ... 1 1 6 6 7 10 10 12 22 25
264 NaN Comoros -11.645500 43.333300 0 0 0 0 0 0 ... 0 0 0 1 1 3 3 3 3 8
265 NaN Tajikistan 38.861034 71.276093 0 0 0 0 0 0 ... 0 0 0 15 15 76 128 230 293 379

266 rows × 110 columns

In [136]:
covid_time_series_covid_19_R = pd.read_csv('time_series_covid19_recovered.csv')
covid_time_series_covid_19_R
Out[136]:
Province/State Country/Region Lat Long 1/22/20 1/23/20 1/24/20 1/25/20 1/26/20 1/27/20 ... 4/27/20 4/28/20 4/29/20 4/30/20 5/1/20 5/2/20 5/3/20 5/4/20 5/5/20 5/6/20
0 NaN Afghanistan 33.000000 65.000000 0 0 0 0 0 0 ... 220 228 252 260 310 331 345 397 421 458
1 NaN Albania 41.153300 20.168300 0 0 0 0 0 0 ... 422 431 455 470 488 519 531 543 570 595
2 NaN Algeria 28.033900 1.659600 0 0 0 0 0 0 ... 1558 1651 1702 1779 1821 1872 1936 1998 2067 2197
3 NaN Andorra 42.506300 1.521800 0 0 0 0 0 0 ... 385 398 423 468 468 472 493 499 514 521
4 NaN Angola -11.202700 17.873900 0 0 0 0 0 0 ... 6 6 7 7 11 11 11 11 11 11
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
247 NaN Western Sahara 24.215500 -12.885800 0 0 0 0 0 0 ... 5 5 5 5 5 5 5 5 5 5
248 NaN Sao Tome and Principe 0.186360 6.613081 0 0 0 0 0 0 ... 0 4 4 4 4 4 4 4 4 4
249 NaN Yemen 15.552727 48.516388 0 0 0 0 0 0 ... 1 1 1 1 1 1 1 1 1 1
250 NaN Comoros -11.645500 43.333300 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
251 NaN Tajikistan 38.861034 71.276093 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0

252 rows × 110 columns

In [137]:
covid_time_series_D = pd.read_csv('time_series_covid19_deaths.csv')
covid_time_series_D
Out[137]:
Province/State Country/Region Lat Long 1/22/20 1/23/20 1/24/20 1/25/20 1/26/20 1/27/20 ... 4/27/20 4/28/20 4/29/20 4/30/20 5/1/20 5/2/20 5/3/20 5/4/20 5/5/20 5/6/20
0 NaN Afghanistan 33.000000 65.000000 0 0 0 0 0 0 ... 57 58 60 64 68 72 85 90 95 104
1 NaN Albania 41.153300 20.168300 0 0 0 0 0 0 ... 28 30 30 31 31 31 31 31 31 31
2 NaN Algeria 28.033900 1.659600 0 0 0 0 0 0 ... 432 437 444 450 453 459 463 465 470 476
3 NaN Andorra 42.506300 1.521800 0 0 0 0 0 0 ... 40 41 42 42 43 44 45 45 46 46
4 NaN Angola -11.202700 17.873900 0 0 0 0 0 0 ... 2 2 2 2 2 2 2 2 2 2
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
261 NaN Western Sahara 24.215500 -12.885800 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
262 NaN Sao Tome and Principe 0.186360 6.613081 0 0 0 0 0 0 ... 0 0 0 0 1 1 1 3 3 3
263 NaN Yemen 15.552727 48.516388 0 0 0 0 0 0 ... 0 0 0 2 2 2 2 2 4 5
264 NaN Comoros -11.645500 43.333300 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 1
265 NaN Tajikistan 38.861034 71.276093 0 0 0 0 0 0 ... 0 0 0 0 0 2 2 3 5 8

266 rows × 110 columns

Assessing Data:

**Assessing Full_data of covid_19 all countries Data

In [138]:
full_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 216 entries, 0 to 215
Data columns (total 12 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   Country,Other     216 non-null    object 
 1   TotalCases        216 non-null    int64  
 2   NewCases          122 non-null    object 
 3   TotalDeaths       179 non-null    float64
 4   NewDeaths         83 non-null     object 
 5   TotalRecovered    209 non-null    float64
 6   ActiveCases       216 non-null    int64  
 7   Serious,Critical  135 non-null    float64
 8   Tot Cases/1M pop  214 non-null    float64
 9   Deaths/1M pop     177 non-null    float64
 10  TotalTests        183 non-null    float64
 11  Tests/ 1M pop     183 non-null    float64
dtypes: float64(7), int64(2), object(3)
memory usage: 20.4+ KB

At the time of writing this report.

  1. Total 216 countries, 177 countries reported deaths.

  2. TotalRecovered cases reported in 207 countries.

  3. 82 NewDeaths are reported.

  4. Testing reported in 177 countries.

** Some data is missing. It may be True or confirmed cases, deaths,Recovered cases are not reported. We can not analyse such cases.

Data Issue 1:

***Rename Columns name in full_data dataframe .

In [139]:
full_data.rename(columns = lambda X:X.strip().lower().replace(" ","_"),inplace =True)
full_data.rename(columns = lambda X:X.strip().lower().replace("country,other","country"),inplace =True)
full_data.rename(columns = lambda X:X.strip().lower().replace("totalcases","total_cases"),inplace =True)
full_data.rename(columns = lambda X:X.strip().lower().replace("newcases","new_cases"),inplace =True)
full_data.rename(columns = lambda X:X.strip().lower().replace("totaldeaths","total_deaths"),inplace =True)
full_data.rename(columns = lambda X:X.strip().lower().replace("newdeaths","new_deaths"),inplace =True)
full_data.rename(columns = lambda X:X.strip().lower().replace("totalrecovered","total_recovered"),inplace =True)
full_data.rename(columns = lambda X:X.strip().lower().replace("activecases","active_cases"),inplace =True)
full_data.rename(columns = lambda X:X.strip().lower().replace("serious,critical","serious"),inplace =True)
full_data.rename(columns = lambda X:X.strip().lower().replace("totaltests","total_tests"),inplace =True)
In [140]:
full_data.rename(columns = lambda X:X.strip().lower().replace("tot cases/1m_pop","totcases/1m_pop"),inplace =True)

Data Issue 2:

Fill NaN valeus in numeric fields with zero:
In [141]:
full_data.fillna(0, inplace=True)

Data issues:3

Replace comma and plus sign from all the value in dataframe column. 

Data Issue 4:

datatype Object is converted into integer value datatype.
In [142]:
full_data['total_deaths'] = full_data['total_deaths'].astype('int64')
full_data['total_recovered'] = full_data['total_recovered'].astype('int64')
full_data['serious'] = full_data['serious'].astype('int64')
full_data['deaths/1m_pop'] = full_data['deaths/1m_pop'].astype('int64')
full_data['total_tests'] = full_data['total_tests'].astype('int64')
full_data['tests/_1m_pop'] = full_data['tests/_1m_pop'].astype('int64')
In [143]:
full_data['new_cases'] = full_data['new_cases'].str.replace(',', '', regex=True)
full_data['new_deaths'] = full_data['new_deaths'].str.replace(',','', regex=True)
full_data['new_cases'].fillna(0,inplace=True)
full_data['new_deaths'].fillna(0,inplace=True)
full_data['new_cases'] = full_data['new_cases'].astype('int64')
full_data['new_deaths'] = full_data['new_deaths'].astype('int64')
full_data
Out[143]:
country total_cases new_cases total_deaths new_deaths total_recovered active_cases serious tot cases/1m_pop deaths/1m_pop total_tests tests/_1m_pop
0 World 4080142 70851 279280 3304 1425122 2375740 47667 523.0 35 0 0
1 USA 1341281 19496 79823 1208 232360 1029098 16796 4052.0 241 8571364 25895
2 Spain 262783 2666 26478 179 173157 63148 1741 5620.0 566 2467761 52781
3 Italy 218268 1083 30395 194 103031 84842 1034 3610.0 503 2514234 41584
4 UK 215260 3896 31587 346 0 183329 1559 3171.0 465 1728443 25461
... ... ... ... ... ... ... ... ... ... ... ... ...
211 Western Sahara 6 0 0 0 5 1 0 10.0 0 0 0
212 Anguilla 3 0 0 0 3 0 0 200.0 0 0 0
213 Saint Pierre Miquelon 1 0 0 0 0 1 0 173.0 0 0 0
214 China 82887 1 4633 0 78046 208 15 58.0 3 0 0
215 Total: 4080142 70851 279280 3304 1425122 2375740 47667 523.4 35 0 0

216 rows × 12 columns

In [144]:
#full_data['new_cases'] = full_data['new_cases'].Str.replace('+','', regex=True)
#full_data['new_deaths'] = full_data['new_deaths'].str.replace('+','', regex=True)
full_data.drop(full_data.tail(1).index,inplace=True) 
full_data.drop(full_data.head(1).index,inplace=True) 
In [145]:
full_data.describe()
Out[145]:
total_cases new_cases total_deaths new_deaths total_recovered active_cases serious tot cases/1m_pop deaths/1m_pop total_tests tests/_1m_pop
count 2.140000e+02 214.000000 214.000000 214.000000 214.000000 2.140000e+02 214.000000 214.000000 214.000000 2.140000e+02 214.000000
mean 1.906608e+04 331.079439 1305.046729 15.439252 6656.672897 1.110159e+04 222.742991 982.191121 45.032710 2.148384e+05 15193.149533
std 9.865362e+04 1594.622212 6759.503801 89.700930 25562.683331 7.297667e+04 1325.467112 2107.016875 132.545621 7.737569e+05 26422.465363
min 1.000000e+00 0.000000 0.000000 0.000000 0.000000 0.000000e+00 0.000000 0.000000 0.000000 0.000000e+00 0.000000
25% 9.650000e+01 0.000000 2.000000 0.000000 30.250000 2.600000e+01 0.000000 43.500000 0.000000 8.655000e+02 488.250000
50% 7.300000e+02 4.000000 14.000000 0.000000 257.500000 2.775000e+02 2.000000 200.000000 3.000000 1.526900e+04 3655.000000
75% 5.597750e+03 85.750000 116.500000 3.000000 1845.000000 2.275500e+03 22.750000 1070.000000 24.750000 1.321298e+05 18152.750000
max 1.341281e+06 19496.000000 79823.000000 1208.000000 232360.000000 1.029098e+06 16796.000000 18773.000000 1208.000000 8.571364e+06 171971.000000
In [146]:
full_data.sum()
Out[146]:
country             USASpainItalyUKRussiaFranceGermanyBrazilTurkey...
total_cases                                                   4080142
new_cases                                                       70851
total_deaths                                                   279280
new_deaths                                                       3304
total_recovered                                               1424528
active_cases                                                  2375740
serious                                                         47667
tot cases/1m_pop                                               210189
deaths/1m_pop                                                    9637
total_tests                                                  45975420
tests/_1m_pop                                                 3251334
dtype: object
In [147]:
full_data.total_cases.max()
Out[147]:
1341281
In [148]:
full_data.total_deaths.max()
Out[148]:
79823
In [149]:
full_data.total_recovered.max()
Out[149]:
232360
In [150]:
full_data.duplicated().sum()
Out[150]:
0
In [151]:
full_data.isna().sum()
Out[151]:
country             0
total_cases         0
new_cases           0
total_deaths        0
new_deaths          0
total_recovered     0
active_cases        0
serious             0
tot cases/1m_pop    0
deaths/1m_pop       0
total_tests         0
tests/_1m_pop       0
dtype: int64
In [152]:
full_data['country'].value_counts()
Out[152]:
Myanmar                1
Iraq                   1
Timor-Leste            1
Algeria                1
Uruguay                1
                      ..
Trinidad and Tobago    1
Tajikistan             1
Mayotte                1
Namibia                1
Chad                   1
Name: country, Length: 214, dtype: int64
In [153]:
full_data.country.nunique()
Out[153]:
214

All Countries Latitude and Longitude Data:

Some latitude and longitude data is missing in countries_data.

In [154]:
countries = pd.read_csv('countries_data.csv', encoding= 'unicode_escape')
countries
Out[154]:
country latitude longitude name
0 AD 42.546245 1.601554 Andorra
1 AE 23.424076 53.847818 United Arab Emirates
2 AF 33.939110 67.709953 Afghanistan
3 AG 17.060816 -61.796428 Antigua and Barbuda
4 AI 18.220554 -63.068615 Anguilla
... ... ... ... ...
270 MS Zaandam 52.442039 4.829199 MS Zaandam
271 Caribbean Netherlands 12.178400 68.2385 Caribbean Netherlands
272 St. Barth 17.900000 62.8333 St. Barth
273 Saint Pierre Miquelon 46.885200 56.3159 Saint Pierre Miquelon
274 CAR 6.611100 20.9394 CAR

275 rows × 4 columns

In [155]:
def getLat(country):
    row = countries.loc[countries['name'] == country ]
    return row.latitude.values
In [156]:
def getLong(country):
    row = countries.loc[countries['name'] == country]   
    return row.longitude.values
In [157]:
full_data['lat'] =full_data.apply(lambda row: getLat(row['country']), axis=1)
full_data['lat'] = full_data['lat'].str.get(0)
full_data['long'] =full_data.apply(lambda row: getLong(row['country']), axis=1)
full_data['long']= full_data['long'].str.get(0)
full_data['long'] = full_data['long'].astype('float') 
full_data.head()
Out[157]:
country total_cases new_cases total_deaths new_deaths total_recovered active_cases serious tot cases/1m_pop deaths/1m_pop total_tests tests/_1m_pop lat long
1 USA 1341281 19496 79823 1208 232360 1029098 16796 4052.0 241 8571364 25895 37.090240 -95.712891
2 Spain 262783 2666 26478 179 173157 63148 1741 5620.0 566 2467761 52781 40.463667 -3.749220
3 Italy 218268 1083 30395 194 103031 84842 1034 3610.0 503 2514234 41584 41.871940 12.567380
4 UK 215260 3896 31587 346 0 183329 1559 3171.0 465 1728443 25461 55.378100 3.436000
5 Russia 198676 10817 1827 104 31916 164933 2300 1361.0 13 5221964 35783 61.524010 105.318756

latitude and longitude data country names fix manually in csv file. latitude, longitude data and full_data dataframe from worldometer combined in one table and saved in new csv.

In [158]:
full_data.to_csv('covid.csv')
In [159]:
#full_data['lat'] = full_data.apply(lambda row: getLatL(row['country']), axis=1)
#full_data['long'] = full_data.apply(lambda row: getLong(row['country']), axis=1)
#full_data.info()

** Assessing Data of covid_19 India Data .

In [160]:
covid_india.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1671 entries, 0 to 1670
Data columns (total 9 columns):
 #   Column                    Non-Null Count  Dtype 
---  ------                    --------------  ----- 
 0   Sno                       1671 non-null   int64 
 1   Date                      1671 non-null   object
 2   Time                      1671 non-null   object
 3   State/UnionTerritory      1671 non-null   object
 4   ConfirmedIndianNational   1671 non-null   object
 5   ConfirmedForeignNational  1671 non-null   object
 6   Cured                     1671 non-null   int64 
 7   Deaths                    1671 non-null   int64 
 8   Confirmed                 1671 non-null   int64 
dtypes: int64(4), object(5)
memory usage: 117.6+ KB
In [161]:
covid_india.describe()
Out[161]:
Sno Cured Deaths Confirmed
count 1671.000000 1671.000000 1671.000000 1671.000000
mean 836.000000 86.929982 13.054458 408.981448
std 482.520466 251.026352 49.430082 1188.132537
min 1.000000 0.000000 0.000000 0.000000
25% 418.500000 0.000000 0.000000 5.000000
50% 836.000000 5.000000 0.000000 32.000000
75% 1253.500000 34.000000 4.000000 258.500000
max 1671.000000 2819.000000 617.000000 15525.000000
In [162]:
#covid_india['State/UnionTerritory'].value_counts()
In [163]:
covid_india.duplicated().sum()
Out[163]:
0
In [164]:
covid_india.isna().sum()
Out[164]:
Sno                         0
Date                        0
Time                        0
State/UnionTerritory        0
ConfirmedIndianNational     0
ConfirmedForeignNational    0
Cured                       0
Deaths                      0
Confirmed                   0
dtype: int64
In [165]:
covid_india.Deaths.max()
Out[165]:
617
In [166]:
covid_india['State/UnionTerritory'].unique()
Out[166]:
array(['Kerala', 'Telengana', 'Delhi', 'Rajasthan', 'Uttar Pradesh',
       'Haryana', 'Ladakh', 'Tamil Nadu', 'Karnataka', 'Maharashtra',
       'Punjab', 'Jammu and Kashmir', 'Andhra Pradesh', 'Uttarakhand',
       'Odisha', 'Puducherry', 'West Bengal', 'Chhattisgarh',
       'Chandigarh', 'Gujarat', 'Himachal Pradesh', 'Madhya Pradesh',
       'Bihar', 'Manipur', 'Mizoram', 'Andaman and Nicobar Islands',
       'Goa', 'Unassigned', 'Assam', 'Jharkhand', 'Arunachal Pradesh',
       'Tripura', 'Nagaland', 'Meghalaya', 'Nagaland#', 'Jharkhand#',
       'Dadar Nagar Haveli'], dtype=object)
In [167]:
covid_india_state_testing.sum()
Out[167]:
Date            2020-04-172020-04-242020-04-272020-05-012020-0...
State           Andaman and Nicobar IslandsAndaman and Nicobar...
TotalSamples                                          1.57851e+07
Negative                                              1.30916e+07
Positive                                                   571595
dtype: object
In [168]:
covid_india_state_testing.duplicated().sum()
Out[168]:
2
In [169]:
covid_india_state_testing.isna().sum()
Out[169]:
Date              0
State             0
TotalSamples      0
Negative        136
Positive          9
dtype: int64
In [170]:
covid_india_state_testing.describe()
Out[170]:
TotalSamples Negative Positive
count 759.000000 623.000000 750.000000
mean 20797.184453 21013.754414 762.126667
std 29959.945068 30047.832110 1542.251749
min 58.000000 0.000000 0.000000
25% 2392.500000 2492.500000 33.000000
50% 8612.000000 8310.000000 207.000000
75% 24910.000000 24462.500000 737.500000
max 175323.000000 162349.000000 14541.000000

**Assessing time_series Data of covid_19 All Countries Data .

In [171]:
covid_time_series_C.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 266 entries, 0 to 265
Columns: 110 entries, Province/State to 5/6/20
dtypes: float64(2), int64(106), object(2)
memory usage: 228.7+ KB
In [172]:
covid_time_series_covid_19_R.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 252 entries, 0 to 251
Columns: 110 entries, Province/State to 5/6/20
dtypes: float64(2), int64(106), object(2)
memory usage: 216.7+ KB
In [173]:
covid_time_series_D.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 266 entries, 0 to 265
Columns: 110 entries, Province/State to 5/6/20
dtypes: float64(2), int64(106), object(2)
memory usage: 228.7+ KB
In [174]:
covid_time_series_C.isna().sum()
Out[174]:
Province/State    184
Country/Region      0
Lat                 0
Long                0
1/22/20             0
                 ... 
5/2/20              0
5/3/20              0
5/4/20              0
5/5/20              0
5/6/20              0
Length: 110, dtype: int64
In [175]:
covid_time_series_covid_19_R.isna().sum()
Out[175]:
Province/State    185
Country/Region      0
Lat                 0
Long                0
1/22/20             0
                 ... 
5/2/20              0
5/3/20              0
5/4/20              0
5/5/20              0
5/6/20              0
Length: 110, dtype: int64
In [176]:
covid_time_series_D.isna().sum()
Out[176]:
Province/State    184
Country/Region      0
Lat                 0
Long                0
1/22/20             0
                 ... 
5/2/20              0
5/3/20              0
5/4/20              0
5/5/20              0
5/6/20              0
Length: 110, dtype: int64
In [177]:
covid_time_series_C.describe()
Out[177]:
Lat Long 1/22/20 1/23/20 1/24/20 1/25/20 1/26/20 1/27/20 1/28/20 1/29/20 ... 4/27/20 4/28/20 4/29/20 4/30/20 5/1/20 5/2/20 5/3/20 5/4/20 5/5/20 5/6/20
count 266.000000 266.000000 266.000000 266.000000 266.000000 266.000000 266.000000 266.000000 266.000000 266.000000 ... 266.000000 2.660000e+02 2.660000e+02 2.660000e+02 2.660000e+02 2.660000e+02 2.660000e+02 2.660000e+02 2.660000e+02 2.660000e+02
mean 21.259359 22.432499 2.086466 2.458647 3.537594 5.390977 7.962406 11.003759 20.969925 23.180451 ... 11367.375940 1.164357e+04 1.192589e+04 1.224381e+04 1.257059e+04 1.288475e+04 1.318319e+04 1.347013e+04 1.376952e+04 1.411782e+04
std 24.747943 70.478908 27.279200 27.377862 34.083035 47.434934 66.289178 89.313757 219.187744 220.524977 ... 65963.451143 6.750722e+04 6.918850e+04 7.102979e+04 7.311436e+04 7.495796e+04 7.657327e+04 7.801568e+04 7.957189e+04 8.121537e+04
min -51.796300 -135.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00
25% 6.907750 -18.093125 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 74.250000 7.500000e+01 7.600000e+01 7.725000e+01 8.100000e+01 8.200000e+01 8.225000e+01 8.675000e+01 9.350000e+01 9.525000e+01
50% 23.488100 20.921188 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 423.500000 4.335000e+02 4.555000e+02 4.665000e+02 4.855000e+02 4.995000e+02 5.085000e+02 5.425000e+02 5.480000e+02 5.560000e+02
75% 41.143200 77.191525 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 1974.500000 2.020000e+03 2.047250e+03 2.105250e+03 2.165750e+03 2.359250e+03 2.515000e+03 2.635750e+03 2.699750e+03 2.872750e+03
max 71.706900 178.065000 444.000000 444.000000 549.000000 761.000000 1058.000000 1423.000000 3554.000000 3554.000000 ... 988197.000000 1.012582e+06 1.039909e+06 1.069424e+06 1.103461e+06 1.132539e+06 1.158040e+06 1.180375e+06 1.204351e+06 1.228603e+06

8 rows × 108 columns

In [178]:
covid_time_series_covid_19_R.describe()
Out[178]:
Lat Long 1/22/20 1/23/20 1/24/20 1/25/20 1/26/20 1/27/20 1/28/20 1/29/20 ... 4/27/20 4/28/20 4/29/20 4/30/20 5/1/20 5/2/20 5/3/20 5/4/20 5/5/20 5/6/20
count 252.000000 252.000000 252.000000 252.000000 252.000000 252.000000 252.000000 252.000000 252.000000 252.000000 ... 252.000000 252.000000 252.000000 252.000000 252.000000 252.000000 252.000000 252.000000 252.000000 252.000000
mean 19.997457 28.167963 0.111111 0.119048 0.142857 0.154762 0.206349 0.242063 0.424603 0.500000 ... 3466.972222 3598.980159 3763.591270 4023.297619 4176.250000 4337.746032 4465.222222 4613.984127 4757.269841 4942.115079
std 24.408240 67.225277 1.763834 1.767827 1.958592 2.024712 2.654973 2.858017 5.069858 5.577059 ... 14409.367219 14829.617179 15399.747262 16859.803967 17468.700495 18210.920341 18611.712422 19134.155254 19528.619764 20013.875983
min -51.796300 -106.346800 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
25% 6.565350 -7.825200 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 18.000000 19.000000 19.750000 24.750000 25.750000 25.750000 27.000000 28.500000 30.000000 30.000000
50% 21.805100 23.409400 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 170.000000 182.000000 182.000000 198.000000 209.500000 213.000000 226.000000 232.000000 242.000000 256.500000
75% 39.329025 85.953175 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 927.500000 977.500000 1007.500000 1022.000000 1083.250000 1156.500000 1224.000000 1275.750000 1317.250000 1342.500000
max 71.706900 178.065000 28.000000 28.000000 31.000000 32.000000 42.000000 45.000000 80.000000 88.000000 ... 114500.000000 117400.000000 120720.000000 153947.000000 164015.000000 175382.000000 180152.000000 187180.000000 189791.000000 189910.000000

8 rows × 108 columns

In [179]:
covid_time_series_D.describe()
Out[179]:
Lat Long 1/22/20 1/23/20 1/24/20 1/25/20 1/26/20 1/27/20 1/28/20 1/29/20 ... 4/27/20 4/28/20 4/29/20 4/30/20 5/1/20 5/2/20 5/3/20 5/4/20 5/5/20 5/6/20
count 266.000000 266.000000 266.000000 266.000000 266.000000 266.000000 266.000000 266.000000 266.000000 266.000000 ... 266.000000 266.000000 266.000000 266.000000 266.000000 266.000000 266.000000 266.000000 266.000000 266.000000
mean 21.259359 22.432499 0.063910 0.067669 0.097744 0.157895 0.210526 0.308271 0.492481 0.500000 ... 806.180451 830.071429 855.883459 877.281955 897.063910 916.571429 930.338346 945.627820 967.063910 991.845865
std 24.747943 70.478908 1.042337 1.043908 1.473615 2.453621 3.189730 4.660845 7.664297 7.664793 ... 4605.196763 4744.311185 4906.661988 5033.525489 5151.692272 5255.657646 5334.737320 5413.828455 5546.438359 5694.709721
min -51.796300 -135.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
25% 6.907750 -18.093125 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000
50% 23.488100 20.921188 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 7.000000 7.000000 7.000000 8.000000 8.000000 8.000000 9.000000 9.000000 9.000000 9.000000
75% 41.143200 77.191525 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 56.750000 58.000000 59.750000 61.000000 67.500000 70.500000 75.250000 78.000000 79.750000 85.750000
max 71.706900 178.065000 17.000000 17.000000 24.000000 40.000000 52.000000 76.000000 125.000000 125.000000 ... 56219.000000 58355.000000 60967.000000 62996.000000 64943.000000 66369.000000 67682.000000 68922.000000 71064.000000 73431.000000

8 rows × 108 columns

Data Issue 5:

***Drop [Sno.] from column in Covid_india_age dataframe.

In [180]:
covid_india_age = covid_india_age.drop(['Sno'], axis=1)
In [181]:
covid_india_age
Out[181]:
AgeGroup TotalCases Percentage
0 0-9 22 3.18%
1 10-19 27 3.90%
2 20-29 172 24.86%
3 30-39 146 21.10%
4 40-49 112 16.18%
5 50-59 77 11.13%
6 60-69 89 12.86%
7 70-79 28 4.05%
8 >=80 10 1.45%
9 Missing 9 1.30%

*** Covid_india['State/UnionTerritory']csv file assigned the value in nagaland two time .these issue is negelected for analysis .

Data Issue 6:

*** Missing value (NaN)in 3 covid_time_series dataframe:Confirmed,Recovered,Death.

In [182]:
covid_time_series_covid_19_R.fillna(0, inplace=True)
In [183]:
covid_time_series_C.fillna(0,inplace = True)
In [184]:
covid_time_series_D.fillna(0,inplace =True)

Data Issue 7:

***Drop [ lat,long] from column in Covid time_series dataframe. irrelevent in this analysis

In [185]:
#covid_time_series_covid_19_R = covid_time_series_covid_19_R.drop(['Province/State','Lat','Long'], axis=1)
covid_time_series_covid_19_R = covid_time_series_covid_19_R.drop(['Lat','Long'], axis=1)

covid_time_series_covid_19_R 
Out[185]:
Province/State Country/Region 1/22/20 1/23/20 1/24/20 1/25/20 1/26/20 1/27/20 1/28/20 1/29/20 ... 4/27/20 4/28/20 4/29/20 4/30/20 5/1/20 5/2/20 5/3/20 5/4/20 5/5/20 5/6/20
0 0 Afghanistan 0 0 0 0 0 0 0 0 ... 220 228 252 260 310 331 345 397 421 458
1 0 Albania 0 0 0 0 0 0 0 0 ... 422 431 455 470 488 519 531 543 570 595
2 0 Algeria 0 0 0 0 0 0 0 0 ... 1558 1651 1702 1779 1821 1872 1936 1998 2067 2197
3 0 Andorra 0 0 0 0 0 0 0 0 ... 385 398 423 468 468 472 493 499 514 521
4 0 Angola 0 0 0 0 0 0 0 0 ... 6 6 7 7 11 11 11 11 11 11
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
247 0 Western Sahara 0 0 0 0 0 0 0 0 ... 5 5 5 5 5 5 5 5 5 5
248 0 Sao Tome and Principe 0 0 0 0 0 0 0 0 ... 0 4 4 4 4 4 4 4 4 4
249 0 Yemen 0 0 0 0 0 0 0 0 ... 1 1 1 1 1 1 1 1 1 1
250 0 Comoros 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
251 0 Tajikistan 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0

252 rows × 108 columns

In [186]:
covid_time_series_C =covid_time_series_C.drop(['Lat','Long'], axis=1)

#covid_time_series_C =covid_time_series_C.drop(['Province/State','Lat','Long'], axis=1)
#covid_time_series_C
In [187]:
covid_time_series_D = covid_time_series_D.drop(['Lat','Long'], axis=1)
#covid_time_series_D = covid_time_series_D.drop(['Province/State','Lat','Long'], axis=1)
#covid_time_series_D

Data Issue 8:

** Difficult to analyze time_series data.

Time series data consist of day wise date in different coloums which is not good for analysis. First we Unpivot date columns[3:] with variable column ‘Date’ and value column ‘Confirmed’, 'Recovered', 'Death'.

In [188]:
dates = covid_time_series_C.columns[3:]
covid_time_series_C = covid_time_series_C.melt(
    id_vars=['Country/Region','Province/State'], 
    value_vars=dates, 
    var_name='Date', 
    value_name='Confirmed'

)

covid_time_series_C
Out[188]:
Country/Region Province/State Date Confirmed
0 Afghanistan 0 1/23/20 0
1 Albania 0 1/23/20 0
2 Algeria 0 1/23/20 0
3 Andorra 0 1/23/20 0
4 Angola 0 1/23/20 0
... ... ... ... ...
27925 Western Sahara 0 5/6/20 6
27926 Sao Tome and Principe 0 5/6/20 174
27927 Yemen 0 5/6/20 25
27928 Comoros 0 5/6/20 8
27929 Tajikistan 0 5/6/20 379

27930 rows × 4 columns

In [189]:
covid_time_series_C = covid_time_series_C.groupby(['Country/Region', 'Date'], as_index=False)['Confirmed'].sum()
In [190]:
dates = covid_time_series_D.columns[3:]
covid_time_series_D = covid_time_series_D.melt(
   id_vars=['Country/Region','Province/State'],
    value_vars=dates, 
    var_name='Date', 
    value_name='Deaths'
)


covid_time_series_D = covid_time_series_D.groupby(['Country/Region', 'Date'], as_index=False)['Deaths'].sum()
covid_time_series_D
Out[190]:
Country/Region Date Deaths
0 Afghanistan 1/23/20 0
1 Afghanistan 1/24/20 0
2 Afghanistan 1/25/20 0
3 Afghanistan 1/26/20 0
4 Afghanistan 1/27/20 0
... ... ... ...
19630 Zimbabwe 5/2/20 4
19631 Zimbabwe 5/3/20 4
19632 Zimbabwe 5/4/20 4
19633 Zimbabwe 5/5/20 4
19634 Zimbabwe 5/6/20 4

19635 rows × 3 columns

In [191]:
dates = covid_time_series_covid_19_R.columns[3:]
covid_time_series_covid_19_R= covid_time_series_covid_19_R.melt(
    id_vars=['Country/Region','Province/State'],
    value_vars=dates, 
    var_name='Date', 
    value_name='Recovered'
    
)

Data Issue 9

  • Also world data consist on provincial data from different countrys. Country data is grouped and aggregated using group by.

  • Result of group by is a series object with country data grouped by date. I have converted the series object into dataframe to avoid grouping.

In [192]:
covid_time_series_covid_19_R = covid_time_series_covid_19_R.groupby(['Country/Region', 'Date'], as_index=False)['Recovered'].sum()
covid_time_series_covid_19_R
Out[192]:
Country/Region Date Recovered
0 Afghanistan 1/23/20 0
1 Afghanistan 1/24/20 0
2 Afghanistan 1/25/20 0
3 Afghanistan 1/26/20 0
4 Afghanistan 1/27/20 0
... ... ... ...
19630 Zimbabwe 5/2/20 5
19631 Zimbabwe 5/3/20 5
19632 Zimbabwe 5/4/20 5
19633 Zimbabwe 5/5/20 5
19634 Zimbabwe 5/6/20 5

19635 rows × 3 columns

Data Issue 10:

*Merge

Covid_time_series_C, Covid_time_series_D, Covid_time_series_covid_19_R

using merge function.

*Calculate Death_percentage and Recovered_percentage in Covid_time_series dataframe.

In [193]:
covid_time_series= covid_time_series_C.merge(right=covid_time_series_D, how='left',on=['Country/Region', 'Date'])
covid_time_series = covid_time_series.merge( right=covid_time_series_covid_19_R, how='left',on=['Country/Region', 'Date'])
covid_time_series['Death_percentage'] = (covid_time_series.Deaths / covid_time_series.Confirmed)/100
covid_time_series['Recovered_percentage'] = (covid_time_series.Recovered / covid_time_series.Confirmed)/100
covid_time_series.duplicated().sum()
covid_time_series = covid_time_series.drop_duplicates() 

Data Issue 11:

As the cases started appearing in countries at different time. This way on a particualr date it isnot a good comparision between difeerent countries as we ought to analyse the outbread tread rate. I calclated the first date of corono positive case in each country and added a column in the dataframe. Then using this date I assigned the days passed after 1st corona case to each row. Now we can compare different couries based one days past first case.

In [194]:
#covid_time_seriesChina = covid_time_series[covid_time_series['Country/Region'] == 'China']
#covid_time_seriesChina.head(50)
covid_time_series
Out[194]:
Country/Region Date Confirmed Deaths Recovered Death_percentage Recovered_percentage
0 Afghanistan 1/23/20 0 0 0 NaN NaN
1 Afghanistan 1/24/20 0 0 0 NaN NaN
2 Afghanistan 1/25/20 0 0 0 NaN NaN
3 Afghanistan 1/26/20 0 0 0 NaN NaN
4 Afghanistan 1/27/20 0 0 0 NaN NaN
... ... ... ... ... ... ... ...
19630 Zimbabwe 5/2/20 34 4 5 0.001176 0.001471
19631 Zimbabwe 5/3/20 34 4 5 0.001176 0.001471
19632 Zimbabwe 5/4/20 34 4 5 0.001176 0.001471
19633 Zimbabwe 5/5/20 34 4 5 0.001176 0.001471
19634 Zimbabwe 5/6/20 34 4 5 0.001176 0.001471

19635 rows × 7 columns

In [195]:
def getfirst_iterrows_loop(df):
    for index, row in df.iterrows():
        if (row['Confirmed'] == 1):
            return row['Date']
    return None
  
df3=covid_time_series.groupby(['Country/Region'])['Country/Region','Confirmed','Date'].apply(getfirst_iterrows_loop).reset_index()
df3.info()
C:\Users\cody\Anaconda2\envs\mypython3\lib\site-packages\ipykernel_launcher.py:7: FutureWarning:

Indexing with multiple keys (implicitly converted to a tuple of keys) will be deprecated, use a list instead.

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 187 entries, 0 to 186
Data columns (total 2 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   Country/Region  187 non-null    object
 1   0               126 non-null    object
dtypes: object(2)
memory usage: 3.0+ KB
In [196]:
df3.rename(columns={ df3.columns[1]: "Day" }, inplace = True)
In [197]:
from datetime import datetime
def setDay(record):
    days = 0
    for index, row in df3.iterrows():
        if (row['Country/Region'] == record['Country/Region'] and row['Day'] != None):
            delta = datetime.strptime(record['Date'], '%m/%d/%y').date() - datetime.strptime(row['Day'], '%m/%d/%y').date()
            days = delta.days
    return  0 if days < 0 else days 
In [198]:
covid_time_series.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 19635 entries, 0 to 19634
Data columns (total 7 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   Country/Region        19635 non-null  object 
 1   Date                  19635 non-null  object 
 2   Confirmed             19635 non-null  int64  
 3   Deaths                19635 non-null  int64  
 4   Recovered             19635 non-null  int64  
 5   Death_percentage      11871 non-null  float64
 6   Recovered_percentage  11871 non-null  float64
dtypes: float64(2), int64(3), object(2)
memory usage: 1.2+ MB
In [199]:
covid_time_series['Day'] =covid_time_series.apply(lambda row: setDay(row), axis=1)
covid_time_series['Day']
Out[199]:
0         0
1         0
2         0
3         0
4         0
         ..
19630    43
19631    44
19632    45
19633    46
19634    47
Name: Day, Length: 19635, dtype: int64
In [200]:
#covid_time_series.head(50)

Data Issue 12:

*** Replace NaN value with Zero in Death_Percentage ,Recovered_percentage in covid_time_series.

In [201]:
covid_time_series.fillna(0, inplace=True)
In [202]:
#covid_time_series['Date'] = pd.to_datetime(covid_time_series['Date'], errors='coerce')
#covid_time_series['Day'] = covid_time_series['Date'].dt.day
#covid_time_series['Day']

*** Store Covid_time_series data in New csv file:

In [203]:
covid_time_series.to_csv('covid_time_series1.csv')
In [204]:
covid_time_series_I = covid_time_series[covid_time_series['Country/Region']=='India']

covid_time_series_I.head(5)
Out[204]:
Country/Region Date Confirmed Deaths Recovered Death_percentage Recovered_percentage Day
8295 India 1/23/20 0 0 0 0.0 0.0 0
8296 India 1/24/20 0 0 0 0.0 0.0 0
8297 India 1/25/20 0 0 0 0.0 0.0 0
8298 India 1/26/20 0 0 0 0.0 0.0 0
8299 India 1/27/20 0 0 0 0.0 0.0 0

Exploratory data analysis (EDA)

Covid_19 All countries Data:

Mortality rates and recovered cases of covid 19.

In [205]:
fig = px.scatter(full_data,x="total_cases",y="total_deaths",color='country',log_x=True ,log_y=True ,size_max=100, range_x=[1,1000000000],range_y=[1,1000000])
fig.update_traces(textposition='top center')
fig.update_layout(
   # height=800,width=1000,
    title_text='Total Deaths Cases in the world',xaxis = dict(
        tickangle = 90,
        title_text = "Total_cases",
        title_font = {"size": 15},
        title_standoff = 10),
    yaxis = dict(
        title_text = "Total_Deaths Cases",
        title_standoff = 10)
)
fig.show()

-***This plot shows country wise death cases vs total confirmend on a logarithmic scale.

  • Total Deaths Reported in many countries below 10000

As we can see in the graph top five countries on the bases of total confirmed cases, total_deaths rate, recovered cases are

  • USA,
  • Spain ,
  • Italy,
  • UK,
  • Russia
In [206]:
fig = px.scatter(full_data,x="total_cases",y="total_recovered",color='country',  log_x=True ,log_y=True ,size_max=100, range_x=[1,10000000],range_y=[1,1000000])
fig.update_traces(textposition='top center')
fig.update_layout(
   # height=800,width=1000,
    title_text='Total Recovered Cases in the world',
    xaxis = dict(
        tickangle = 90,
        title_text = "Total_cases",
        title_font = {"size": 15},
        title_standoff = 10),
    yaxis = dict(
        title_text = "Total_recovered Cases",
        title_standoff = 10)
)
fig.show()
In [ ]:
* The following plot shows country wise recoverd cases on a logarithmic scale. * here all countries is consider for analysis so, there is overlapping in scatter plot.

Analysis of Covid_19 Cases in all countries with Statsmodel:

-Statsmodels module was used for covid_19 all country data analysis that provides classes and functions for the estimation regression models, for conducting statistical tests, and statistical data exploration of covid_19 data ('total_cases' ,'total_deaths' ,'new_cases' ,'new_deaths' ,'total_recovered' ,'active_cases') in all countries.

  • Assumptions of a regression model:

    1. Relevance of data to the covid_19 cases.
    2. Linearity and additivity
    3. Independence of residuals
    4. Constancy of variance of residuals
    5. Normal distribution of residuals
In [207]:
X = full_data[['total_deaths','new_cases','new_deaths']] 
# #### fit a OLS model with intercept on total_cases and  new_cases,new_deaths.
y = full_data['total_cases']
X = sm.add_constant(X)
est = sm.OLS(y, X).fit()
print(est.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:            total_cases   R-squared:                       0.963
Model:                            OLS   Adj. R-squared:                  0.962
Method:                 Least Squares   F-statistic:                     1822.
Date:                Sun, 10 May 2020   Prob (F-statistic):          5.23e-150
Time:                        02:17:33   Log-Likelihood:                -2411.3
No. Observations:                 214   AIC:                             4831.
Df Residuals:                     210   BIC:                             4844.
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
================================================================================
                   coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------
const         -249.7789   1346.307     -0.186      0.853   -2903.787    2404.229
total_deaths     5.7334      0.563     10.186      0.000       4.624       6.843
new_cases       19.1134      1.938      9.862      0.000      15.293      22.934
new_deaths     356.5852     58.874      6.057      0.000     240.526     472.644
==============================================================================
Omnibus:                      167.889   Durbin-Watson:                   1.552
Prob(Omnibus):                  0.000   Jarque-Bera (JB):            10347.060
Skew:                          -2.313   Prob(JB):                         0.00
Kurtosis:                      36.749   Cond. No.                     7.21e+03
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 7.21e+03. This might indicate that there are
strong multicollinearity or other numerical problems.

Regression Analysis:

*β₀ and β₁ are chosen to minimize the square of the distance between the predicted values and the actual values.

Adj. R-squared indicates that 99% of total_deaths,new_cases,new_deaths can be explained by our predictor variable.

In order to understand trends we look at the slope of the death cases, new cases and total deaths in linearscale..

From our results, we see that • The intercept 𝛽̂0 = -111.27

The regression coefficient (coef) represents the change in the dependent variable resulting from a one unit change in the predictor variable, all other variables being held constant. In our model, a one unit increase in total_deaths, new_cases,new_deaths increase total_cases.

• The slope 𝛽̂1 = 5.0908

• The slope 𝛽̂2 = 11.7764 .

• The slope 𝛽̂3 = 321.7351

• The positive 𝛽̂3 parameter estimate implies high new_deaths rate In line with our assumptions, an increase in total_deaths, new_cases, new_deaths appears to increase the total cases.

The p-value means the probability of an 5.0908 increase in total_cases due to a one unit increase in total_deaths is 0%, assuming there is no relationship between the two variables.

• The p-value of total_deaths and new_cases,new_deaths 0.000 for 𝛽̂1implies that is statistically significant (using p < 0.05 as a rejection rule).

• The p-value of new_deaths is 0.03

** The standard error measures the accuracy of total_deaths coefficient by estimating the variation of the coefficient if the same test were run on a different sample . Our standard error 1, is low and therefore appears accurate.

• The R-squared value of R-squared:0.994.

In [208]:
fig = plt.figure(figsize =(15,8))
results = smf.ols('total_cases~total_deaths+new_cases+new_deaths',data = full_data).fit()
sm.graphics.plot_regress_exog(results, 'total_deaths', fig=fig)
plt.show()

Regression Plot Analysis:

  1. The “Y and Fitted vs. X” graph plots the dependent variable against our predicted values with a confidence interval. The inverse relationship in our graph indicates that total_cases and total_deaths are positively correlated, i.e., when one variable increases the other increase.

2.The “Residuals versus total_deaths graph shows our model's errors versus the specified predictor variable. Each dot is an observed value; the line represents the mean of those observed values.

3.The “Partial regression plot” shows the relationship between total_cases and total_deaths,the impact of adding other independent variables on our existing total_deaths coefficient.

4.the Component and Component Plus Residual (CCPR) plot is an extension of the partial regression plot, but shows where our trend line would lie after adding the impact of adding our other independent variables on our existing total_deaths coefficient.This is the "component" part of the plot and is intended to show where the "fitted line" would lie.

In [209]:
X = full_data[['total_recovered','new_cases','active_cases']] 
#### fit a OLS model with intercept on total_recovered and  new_cases,active_cases.
y = full_data['total_cases']
X = sm.add_constant(X)
est = sm.OLS(y, X).fit()
print(est.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:            total_cases   R-squared:                       0.999
Model:                            OLS   Adj. R-squared:                  0.999
Method:                 Least Squares   F-statistic:                 1.297e+05
Date:                Sun, 10 May 2020   Prob (F-statistic):               0.00
Time:                        02:17:35   Log-Likelihood:                -1958.8
No. Observations:                 214   AIC:                             3926.
Df Residuals:                     210   BIC:                             3939.
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
===================================================================================
                      coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------------------------------------------------
const             148.9922    164.783      0.904      0.367    -175.849     473.834
total_recovered     1.0992      0.009    128.820      0.000       1.082       1.116
new_cases          -1.4876      0.259     -5.752      0.000      -1.997      -0.978
active_cases        1.0893      0.006    188.678      0.000       1.078       1.101
==============================================================================
Omnibus:                      276.426   Durbin-Watson:                   1.539
Prob(Omnibus):                  0.000   Jarque-Bera (JB):            19360.057
Skew:                           5.492   Prob(JB):                         0.00
Kurtosis:                      48.283   Cond. No.                     7.94e+04
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 7.94e+04. This might indicate that there are
strong multicollinearity or other numerical problems.

Regression Analysis:

*β₀ and β₁ are chosen to minimize the square of the distance between the predicted values and the actual values.

Adj. R-squared indicates that 100% of total_recovered,new_cases,active_cases can be explained by our predictor variable.

In order to understand trends we look at the slope of the recoverd_cases, new cases and active_case in linearscale..

From our results, we see that • The intercept 𝛽̂0 = 3.0684 The regression coefficient (coef) represents the change in the dependent variable resulting from a one unit change in the predictor variable, all other variables being held constant. In our model, a one unit increase the recoverd_cases, new cases and active_case increase total_cases.

• The slope 𝛽̂1 = 1.0937.

• The slope 𝛽̂2 = 0.1470

• The slope 𝛽̂3 = 1.0588.

• The positive 𝛽̂2 parameter estimate implies low new_cases rate In line with our assumptions, an increase in the recoverd_cases,active_case appears to increase the total cases.

The p-value means the probability of an 1.1121 increase in total_cases due to a one unit increase in total_recovered is 0%, assuming there is no relationship between the two variables.

• The p-value of total_deaths and new_cases,new_deaths 0.000 for 𝛽̂1implies that is statistically significant (using p < 0.05 as a rejection rule).

** The standard error measures the accuracy of total_deaths coefficient by estimating the variation of the coefficient if the same test were run on a different sample . Our standard error, 0.007, is low and therefore appears accurate.

• The R-squared value of R-squared:1.

In [210]:
fig = plt.figure(figsize =(15,8))
results = smf.ols('total_cases~total_recovered  + new_cases+active_cases', data = full_data).fit()
#sm.graphics.plot_ccpr_grid(results, fig=fig)
sm.graphics.plot_regress_exog(results, 'total_recovered', fig=fig)
plt.show()
###endogenous: caused by factors within the system ,exogenous: caused by factors outside the system

Regression plot Analysis:

  1. The “Y and Fitted vs. X” graph plots the dependent variable against our predicted values with a confidence interval. The inverse relationship in our graph indicates that total_cases and total_recovered cases are positively correlated, i.e., when one variable increases the other increase.

2.The “Residuals versus total_deaths graph shows our model's errors versus the specified predictor variable. Each dot is an observed value; the line represents the mean of those observed values.

3.The “Partial regression plot” shows the relationship between total_cases and total_recovered,the impact of adding other independent variables on our existing total_recovered coefficient.

4.the Component and Component Plus Residual (CCPR) plot is an extension of the partial regression plot, but shows where our trend line would lie after adding the impact of adding our other independent variables on our existing total_recovered coefficient.This is the "component" part of the plot and is intended to show where the "fitted line" would lie.

Analysis of Covid_19 Total_cases ,Active_Case, Serious Cases in All countries.

In [219]:
fig = px.scatter(full_data, y='total_cases', x='active_cases',animation_frame="active_cases",text = "country",range_x =[0,100000],range_y=[0,100000])
fig.update_layout(
   # height=800,width=1000,
    title_text='Total Active_Cases in All countries',xaxis = dict(
        tickangle = 90,
        title_text = "Active_Cases",
        title_font = {"size": 15},
        title_standoff = 10),
    yaxis = dict(
        title_text = "Total_Cases",
        title_standoff = 10)
)
fig.show()
In [221]:
fig2 = px.scatter(full_data, y='total_cases', x='serious', animation_frame="serious",text ='country',range_x =[0,10000],range_y=[0,100000]) 
fig2.update_layout(
    #height=800,width=1000,
    title_text='Total Serious Cases in the world',xaxis = dict(
        tickangle = 90,
        title_text = "Serious_cases",
        title_font = {"size": 15},
        title_standoff = 10),
    yaxis = dict(
        title_text = "Total_Cases",
        title_standoff = 10)
)
fig2.show()

Analysis of Covid_19 Totcase/1Mpop,Deaths/1Mpop,Tests/1Mpop in All countries.

In [213]:
fig = px.scatter(full_data, x='tot\xa0cases/1m_pop', y='deaths/1m_pop', color='country',log_x=True ,log_y=True ,size_max=45)
fig.update_layout(
   # height=800,width=1000,
    title_text='Total Deaths Cases/ 1m_pop in the world',xaxis = dict(
        tickangle = 90,
        title_text = "Total_cases/1m_pop",
        title_font = {"size": 15},
        title_standoff = 10),
    yaxis = dict(
        title_text = "Total_Deaths Cases/1m_pop",
        title_standoff = 10)
)
fig.show()
In [214]:
fig2 = px.scatter(full_data, x='tot\xa0cases/1m_pop', y='tests/_1m_pop', color='country',log_x=True ,log_y=True ,size_max=45) 
fig2.update_layout(
    #height=800,width=1000,
    title_text='Total Test/1m_pop  in the world',xaxis = dict(
        tickangle = 90,
        title_text = "Total_cases/1m_pop",
        title_font = {"size": 15},
        title_standoff = 10),
    yaxis = dict(
        title_text = "Total_Test/1m_pop",
        title_standoff = 10)
)
fig2.show()

World Maps of Total_covid 19 Cases in Countries:

  • Total cases Maps using Plotly ,Mapbox and animation Show how the covid19 Spread in countries.
In [215]:
full_data.fillna(0, inplace=True)
In [ ]:
full_data.head(10)
In [216]:
fig = px.scatter_mapbox(full_data, lat="lat", lon="long",color = 'country', hover_name="country", hover_data=['total_cases', "total_deaths"],
                        color_continuous_scale=px.colors.cyclical.IceFire,
                        animation_frame='total_cases',size_max=55, zoom=3)
fig.update_layout(title_text="Total_Cases in World")
fig.update_layout(mapbox_style="open-street-map")
fig.update_layout(margin={"r":0,"t":0,"l":0,"b":0})
fig.show()

World Maps of Total_Deaths Cases in All Countries:

  • Total deaths Maps using Plotly ,Mapbox .
In [217]:
fig = px.scatter_mapbox(full_data, lat="lat", lon="long",color = 'country', hover_name="country", hover_data=['total_cases',"total_deaths" ,"total_recovered"],
                        color_continuous_scale=px.colors.cyclical.IceFire,
                        animation_frame='total_deaths',zoom=3)
fig.update_layout(title_text="Total_Deaths in World")
fig.update_layout(
 mapbox_style="white-bg",
    mapbox_layers=[
        {
            "below": 'traces',
            "sourcetype": "raster",
            "source": [
                "https://basemap.nationalmap.gov/arcgis/rest/services/USGSImageryOnly/MapServer/tile/{z}/{y}/{x}"
            ]
        }
      ])
fig.update_layout(margin={"r":0,"t":0,"l":0,"b":0})
fig.show()
In [222]:
full_data = full_data.sort_values(by=['tests/_1m_pop'])
full_data
Out[222]:
country total_cases new_cases total_deaths new_deaths total_recovered active_cases serious tot cases/1m_pop deaths/1m_pop total_tests tests/_1m_pop lat long
107 Diamond Princess 712 0 13 0 645 54 4 0.0 0 0 0 35.414583 139.682033
214 China 82887 1 4633 0 78046 208 15 58.0 3 0 0 35.861660 104.195397
70 Cameroon 2274 7 108 0 1232 934 12 86.0 4 0 0 7.369722 12.354722
74 Guinea 2009 0 11 0 663 1335 0 153.0 0 0 0 9.945587 -9.696645
212 Anguilla 3 0 0 0 3 0 0 200.0 0 0 0 18.220554 -63.068615
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
57 Bahrain 4774 330 8 0 2055 2711 2 2806.0 5 178353 104816 25.930414 50.637772
196 Falkland Islands 13 0 0 0 13 0 0 3736.0 0 402 115517 51.796300 59.523600
28 UAE 17417 624 185 11 4295 12937 1 1761.0 19 1200000 121330 23.424100 53.847800
76 Iceland 1801 0 10 0 1773 18 0 5278.0 29 53260 156076 64.963051 -19.020835
142 Faeroe Islands 187 0 0 0 187 0 0 3827.0 0 8403 171971 61.892600 6.911800

214 rows × 14 columns

In [223]:
full_data=pd.melt(full_data, id_vars=['country','tests/_1m_pop'], value_vars=['total_cases', 'new_cases', 'total_deaths', 'new_deaths', 'total_recovered'])
# plotly 
fig = px.line(full_data, x='country', y='value', color='variable',log_x=False ,log_y=True)
fig.update_layout(
   # height=800,width=1000,
    title_text='Total Cases in the world',xaxis = dict(
        tickangle = 90,
        title_text = "Country",
        title_font = {"size": 15},
        title_standoff = 10),
    yaxis = dict(
        title_text = "Value",
        title_standoff = 10)
)
# Show plot 
fig.show()
Exploratory data analysis (EDA) INDIA:
### Covid_19 Cases in India:

Determine mortality rates and Cured cases of covid 19,Analysis is based on the number of excess deaths in India and predicted the mortality and Recovered patients in State and Union Territory .

Univariate analysis:

In [224]:
base_color = sns.color_palette()[1]
plt.figure(figsize=(32,6))
g = sns.countplot(data = covid_india, x ='State/UnionTerritory', color = base_color)
g.set_xticklabels(g.get_xticklabels(), rotation=45)
g.set_title('Covid_19 Analysis Based on State/UnionTerriory')
n_points = covid_india.shape[0]
cat_counts = covid_india['State/UnionTerritory'].value_counts()
locs, labels = plt.xticks() 
for loc, label in zip(locs, labels):
    count = cat_counts[label.get_text()]
    pct_string = '{:0.1f}%'.format(100*count/n_points)
    plt.text(loc, count-8, pct_string, ha = 'left',va='bottom', color = 'black')

BiVariate Analysis of Covid_19 Cases in India:

In [225]:
fig = px.scatter(covid_india,x="Confirmed",y="Deaths" ,animation_frame="Deaths", animation_group="State/UnionTerritory",color="State/UnionTerritory",log_x=True ,log_y=True , range_x=[1,10000],range_y=[1,10000])
fig.update_traces(textposition='top center')

fig.update_layout(
    #height=800,width=1000,
    title_text='Total Deaths Cases in India',xaxis = dict(
        tickangle = 90,
        title_text = "Total_cases",
        title_font = {"size": 15},
        title_standoff = 10),
    yaxis = dict(
        title_text = "Total_Deaths Cases",
        title_standoff = 10),
    
)

fig.show()
In [226]:
fig = px.scatter(covid_india,x="Confirmed",y="Cured", animation_frame="Cured", animation_group="State/UnionTerritory",color="State/UnionTerritory", log_x=True ,log_y=True ,size_max=45, range_x=[1,10000],range_y=[1,10000])
fig.update_traces(textposition='top center')

fig.update_layout(
    #height=800,width=1000,
    title_text='Total Recovered Cases in India',xaxis = dict(
        tickangle = 90,
        title_text = "Total_cases",
        title_font = {"size": 15},
        title_standoff = 10),
    yaxis = dict(
        title_text = "Total Recovered Cases",
        title_standoff = 10)
)


fig.show()

Analysis of Covid_19 Cases in India Based on StatsModel:

-Statsmodels module is used for covid_19 cases that provides classes and functions for the estimation regression models,for conducting statistical tests, and statistical data exploration of covid_19('Confirmed cases','Deaths cases','Cured cases')in india.

In [227]:
X = covid_india[['Deaths']] 
#### fit a OLS model with intercept on Deaths
y = covid_india['Confirmed']
X = sm.add_constant(X)
est = sm.OLS(y, X).fit()
print(est.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:              Confirmed   R-squared:                       0.890
Model:                            OLS   Adj. R-squared:                  0.890
Method:                 Least Squares   F-statistic:                 1.354e+04
Date:                Sun, 10 May 2020   Prob (F-statistic):               0.00
Time:                        02:20:31   Log-Likelihood:                -12356.
No. Observations:                1671   AIC:                         2.472e+04
Df Residuals:                    1669   BIC:                         2.473e+04
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const        112.9189      9.963     11.334      0.000      93.377     132.461
Deaths        22.6790      0.195    116.342      0.000      22.297      23.061
==============================================================================
Omnibus:                     1165.847   Durbin-Watson:                   1.700
Prob(Omnibus):                  0.000   Jarque-Bera (JB):            30900.445
Skew:                           2.903   Prob(JB):                         0.00
Kurtosis:                      23.251   Cond. No.                         52.9
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Regression Analysis:

*β₀ and β₁ are chosen to minimize the square of the distance between the predicted values and the actual values.

Adj. R-squared indicates that 89% of deaths can be explained by our predictor variable.

In order to understand trends we look at the slope of the death cases in linearscale..

From our results, we see that • The intercept 𝛽̂0 = 113 The regression coefficient (coef) represents the change in the dependent variable resulting from a one unit change in the predictor variable, all other variables being held constant. In our model, a one unit increase in deaths, new_cases,increase Confirmed _cases. • The slope 𝛽̂1 = 22.66 In line with our assumptions, an increase in deaths appears to increase the confirmed cases.

The p-value means the probability of an 22.66 increase in Confirmed_cases due to a one unit increase in deaths is 0%, assuming there is no relationship between the two variables.

• The p-value of deaths is 0.000 for 𝛽̂1implies that is statistically significant (using p < 0.05 as a rejection rule).

** The standard error measures the accuracy of deaths coefficient by estimating the variation of the coefficient if the same test were run on a different sample . Our standard error, 0.195, is low and therefore appears accurate.

• The R-squared value of R-squared:0.89 .

In [228]:
fig = plt.figure(figsize =(15,8))
#full_data1= sm.dataset.full_data.load_pandas()
results = smf.ols('Confirmed ~Deaths ', data = covid_india).fit()
#sm.graphics.plot_ccpr_grid(results, fig=fig)
sm.graphics.plot_regress_exog(results, 'Deaths', fig=fig)
plt.show()

Regression Plot Analysis:

  1. The “Y and Fitted vs. X” graph plots the dependent variable against our predicted values with a confidence interval. The inverse relationship in our graph indicates that Confirmed_cases and deaths are positively correlated, i.e., when one variable increases the other increase.

2.The “Residuals versus total_deaths graph shows our model's errors versus the specified predictor variable. Each dot is an observed value; the line represents the mean of those observed values.

3.The “Partial regression plot” shows the relationship between Confirmed_cases and deaths,the impact of adding other independent variables on our existing deaths coefficient.

4.the Component and Component Plus Residual (CCPR) plot is an extension of the partial regression plot, but shows where our trend line would lie after adding the impact of adding our other independent variables on our existing deaths coefficient.This is the "component" part of the plot and is intended to show where the "fitted line" would lie.

In [229]:
X = covid_india[['Cured']] 
#### fit a OLS model with intercept on Cured Cases
y = covid_india['Confirmed']
X = sm.add_constant(X)
est = sm.OLS(y, X).fit()
print(est.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:              Confirmed   R-squared:                       0.793
Model:                            OLS   Adj. R-squared:                  0.793
Method:                 Least Squares   F-statistic:                     6388.
Date:                Sun, 10 May 2020   Prob (F-statistic):               0.00
Time:                        02:20:33   Log-Likelihood:                -12886.
No. Observations:                1671   AIC:                         2.578e+04
Df Residuals:                    1669   BIC:                         2.579e+04
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const         42.6187     14.004      3.043      0.002      15.151      70.086
Cured          4.2145      0.053     79.925      0.000       4.111       4.318
==============================================================================
Omnibus:                      645.833   Durbin-Watson:                   1.660
Prob(Omnibus):                  0.000   Jarque-Bera (JB):            25483.266
Skew:                           1.103   Prob(JB):                         0.00
Kurtosis:                      22.004   Cond. No.                         281.
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Regression Analysis:

*β₀ and β₁ are chosen to minimize the square of the distance between the predicted values and the actual values.

Adj. R-squared indicates that 79% of Cured Cases can be explained by our predictor variable.

In order to understand trends we look at the slope of the Cured Cases in linearscale..

From our results, we see that • The intercept 𝛽̂0 = 42.61 The regression coefficient (coef) represents the change in the dependent variable resulting from a one unit change in the predictor variable, all other variables being held constant. In our model, a one unit increase in Cured increase Confirmed. • The slope 𝛽̂1 = 4.21 In line with our assumptions, an increase in cured cases appears to increase the Confirmed cases.

The p-value means the probability of an 4.21 increase in Confirmed cases due to a one unit increase in Cured Cases is 0%, assuming there is no relationship between the two variables.

• The p-value of cured 0.000 for 𝛽̂1implies that is statistically significant (using p < 0.05 as a rejection rule).

** The standard error measures the accuracy of total_deaths coefficient by estimating the variation of the coefficient if the same test were run on a different sample . Our standard error, 0.050, is low and therefore appears accurate.

• The R-squared value of R-squared:0.79 .

In [230]:
fig = plt.figure(figsize =(15,8))
#full_data1= sm.dataset.full_data.load_pandas()
results = smf.ols('Confirmed ~Cured', data = covid_india).fit()
#sm.graphics.plot_ccpr_grid(results, fig=fig)
sm.graphics.plot_regress_exog(results, 'Cured', fig=fig)
plt.show()

Regression Plot Analysis:

  1. The “Y and Fitted vs. X” graph plots the dependent variable against our predicted values with a confidence interval. The inverse relationship in our graph indicates that Cured and Confirmed are positively correlated, i.e., when one variable increases the other increase.

2.The “Residuals versus cured graph shows our model's errors versus the specified predictor variable. Each dot is an observed value; the line represents the mean of those observed values.

3.The “Partial regression plot” shows the relationship between Cured and confirmed cases ,the impact of adding other independent variables on our existing cured coefficient.

4.the Component and Component Plus Residual (CCPR) plot is an extension of the partial regression plot, but shows where our trend line would lie after adding the impact of adding our other independent variables on our existing cured coefficient.This is the "component" part of the plot and is intended to show where the "fitted line" would lie.

Analysis of Covid_19 Cases in India and its State /Union Territory :

In [231]:
f,g = plt.subplots(figsize = (15,10))
base_color = sns.color_palette()[0]
g = sns.barplot(data = covid_india, x = 'Confirmed', y = 'State/UnionTerritory',
            label = 'Total Confirmed Cases',color = base_color)
sns.set_color_codes('muted')
g = sns.barplot(x = 'Cured', y = 'State/UnionTerritory', data = covid_india,
            label = 'Total number of Cured', color = 'R', edgecolor = 'w')
sns.set_color_codes('pastel')
g= sns.barplot(x = 'Deaths', y = 'State/UnionTerritory', data = covid_india,
            label = 'Total number of Deaths', color = 'g', edgecolor = 'w')
g.set_title('Analysis of Covid_19 Cases in India')
g.legend(ncol = 3, loc = 'lower right')
#sns.despine(left = True, bottom = True)
plt.show()
C:\Users\cody\Anaconda2\envs\mypython3\lib\site-packages\seaborn\utils.py:123: MatplotlibDeprecationWarning:

Support for uppercase single-letter colors is deprecated since Matplotlib 3.1 and will be removed in 3.3; please use lowercase instead.

Analysis of Covid_19 Cases in India and its State /Union Territory

In [232]:
plt.figure(2, figsize=(20,15))
fig,ax = plt.subplots(1, 2)
g =sns.scatterplot(data=covid_india,y="Deaths",x="Confirmed",ax=ax[0],hue="State/UnionTerritory",palette="deep")
g.legend(loc='upper right', bbox_to_anchor=(.20, 0.0), ncol=1)
g =sns.scatterplot(data=covid_india,y="Cured",x="Confirmed",ax=ax[1],hue="State/UnionTerritory",palette="deep")
g.set_title('Covid_19 Deaths Cases and Recovered Cases in India and its State/ Union Territory')
g.legend(loc='upper left', bbox_to_anchor=(1.0, 0.0), ncol=1)
Out[232]:
<matplotlib.legend.Legend at 0x12ef2cd70f0>
<Figure size 1440x1080 with 0 Axes>

Analysis Based on Age group in india

In [233]:
fig = px.bar(covid_india_age, y='TotalCases', x='AgeGroup', text='Percentage')
fig.update_traces(texttemplate='%{text}', textposition='outside',marker_color='lightsalmon')
fig.update_layout(uniformtext_minsize=8, uniformtext_mode='hide')
fig.update_layout(title_text="Anaysis Based on Age Group in India")

fig.show()
In [234]:
s = covid_india_state_testing.sum()
s
Out[234]:
Date            2020-04-172020-04-242020-04-272020-05-012020-0...
State           Andaman and Nicobar IslandsAndaman and Nicobar...
TotalSamples                                          1.57851e+07
Negative                                              1.30916e+07
Positive                                                   571595
dtype: object

Analysis of Covid_19 Cases Based on Testing (Positive Cases and Negative Case) in India :

In [235]:
fig = px.scatter(covid_india_state_testing, x='Date', y='Positive', title='Positive Cases Time Series with Rangeslider', range_x=['2020-01-23','2020-05-04'],range_y = [1,100000])
fig.update_traces(marker_color='indianred')
fig.update_xaxes(rangeslider_visible=True)
fig.update_layout(
   # height=800,width=1000,
    title_text='Covid_19 Testing in India',xaxis = dict(
        tickangle = 90,
        title_text = "Date",
        title_font = {"size": 15},
        title_standoff = 10),
    yaxis = dict(
        title_text = "Positive",
        title_standoff = 10),
    
)

fig.show()
In [236]:
fig1 = px.scatter(covid_india_state_testing, x='Date', y='Negative', title='Negative Cases Time Series with Rangeslider',range_x = ['2020-01-23','2020-05-04'],range_y = [1,1000000])
fig1.update_traces(marker_color='lightsalmon')
fig1.update_xaxes(rangeslider_visible=True)
fig1.update_layout(
   # height=800,width=1000,
    title_text='Covid_19 Testing in India',xaxis = dict(
        tickangle = 90,
        title_text = "Date",
        title_font = {"size": 15},
        title_standoff = 10),
    yaxis = dict(
        title_text = "Negative",
        title_standoff = 10),
    
)

fig1.show()

Covid_Time_Series Analysis: Datewise Confirmed Cases,Deaths, Recovered Cases in All Countries :

Date:22/jan/2020 till 6/may/2020

  1. The total number of confirmed cases to date for each country and province.
  2. The percentage of confirmed cases,Deaths,Recovered cases.
  3. A maximum confirmed cases in all countries.
  4. A maximum deaths cases in all countries.
  5. A maximum recovered cases in all countries
In [237]:
covid_time_series.head(10)
Out[237]:
Country/Region Date Confirmed Deaths Recovered Death_percentage Recovered_percentage Day
0 Afghanistan 1/23/20 0 0 0 0.0 0.0 0
1 Afghanistan 1/24/20 0 0 0 0.0 0.0 0
2 Afghanistan 1/25/20 0 0 0 0.0 0.0 0
3 Afghanistan 1/26/20 0 0 0 0.0 0.0 0
4 Afghanistan 1/27/20 0 0 0 0.0 0.0 0
5 Afghanistan 1/28/20 0 0 0 0.0 0.0 0
6 Afghanistan 1/29/20 0 0 0 0.0 0.0 0
7 Afghanistan 1/30/20 0 0 0 0.0 0.0 0
8 Afghanistan 1/31/20 0 0 0 0.0 0.0 0
9 Afghanistan 2/1/20 0 0 0 0.0 0.0 0
In [238]:
fig = px.scatter(covid_time_series_C, x='Date', y='Confirmed',color="Country/Region", title='Confirmed Cases Time Series with Rangeslider',range_x=['1/22/20','4/29/20'],range_y = [1,1500000])
#fig.update_traces(marker_color='darkslateblue')
fig.update_layout(
   # height=800,width=1000,
    title_text= 'Datewise Confirmed Cases in all Countries',xaxis = dict(
        tickangle = 90,
        title_text = "Date",
        title_font = {"size": 15},
        title_standoff = 10),
    yaxis = dict(
        title_text = "Confirmed Cases",
        title_standoff = 10),
    
)

fig.update_xaxes(rangeslider_visible=True)
fig.show()
In [239]:
fig = px.scatter(covid_time_series_D, x='Date', y='Deaths',color="Country/Region", title='Deaths Cases Time Series with Rangeslider',range_x=['1/22/20','4/29/20'],range_y = [1,100000])
#fig.update_traces(marker_color='sandybrown')
fig.update_layout(
   # height=800,width=1000,
    title_text= 'Datewise Deaths Cases in all Countries',xaxis = dict(
        tickangle = 90,
        title_text = "Date",
        title_font = {"size": 15},
        title_standoff = 10),
    yaxis = dict(
        title_text = " Deaths Cases",
        title_standoff = 10),
    
)

fig.update_xaxes(rangeslider_visible=True)

fig.show()
In [240]:
fig = px.scatter(covid_time_series_covid_19_R, x='Date', y='Recovered',color="Country/Region", title='Recovered Cases Time Series with Rangeslider',range_x=['1/22/20','4/29/20'],range_y = [1,1000000])
#fig.update_traces(marker_color='green')
fig.update_layout(
    #height=800,width=1000,
    title_text= 'Datewise Recovered Cases in all Countries',xaxis = dict(
        tickangle = 90,
        title_text = "Date",
        title_font = {"size": 15},
        title_standoff = 10),
    yaxis = dict(
        title_text = "Recovered Cases",
        title_standoff = 10),
    
)

fig.update_xaxes(rangeslider_visible=True)
fig.show()

Covid_Time_Series Analysis: Datewise Confirmed Cases,Deaths, Recovered Cases in India :

Date:22/jan/2020 till 6/may/2020

In [241]:
fig1 = px.scatter(covid_time_series_I, x='Date', y='Confirmed',color="Country/Region", title='Recovered Cases Time Series with Rangeslider',range_x=['1/22/20','4/29/20'],range_y = [1,10000])
fig1.update_layout(
   # height=800,width=1000,
    title_text= 'Datewise Confirmed Cases in India',xaxis = dict(
        tickangle = 90,
        title_text = "Date",
        title_font = {"size": 15},
        title_standoff = 10),
    yaxis = dict(
        title_text = "Confirmed Cases",
        title_standoff = 10),
    
)


fig1.update_xaxes(rangeslider_visible=True)
fig1.show()
In [242]:
fig2 = px.scatter(covid_time_series_I, x='Date', y='Deaths',color="Country/Region",range_x=['1/22/20','4/29/20'],range_y = [1,10000])
fig2.update_layout(
   #height=800,width=1000,
    title_text= 'Datewise Deaths Cases in India ',xaxis = dict(
        tickangle = 90,
        title_text = "Date",
        title_font = {"size": 15},
        title_standoff = 10),
    yaxis = dict(
        title_text = "Deaths Cases",
        title_standoff = 10),
    
)

fig2.update_xaxes(rangeslider_visible=True)
fig2.show()
In [243]:
fig3 = px.scatter(covid_time_series_I, x='Date', y='Recovered',color="Country/Region", title='Recovered Cases Time Series with Rangeslider',range_x=['1/22/20','4/29/20'],range_y = [1,10000])
fig3.update_layout(
    #height=800,width=1000,
    title_text= 'Datewise Recovered Cases in India ',xaxis = dict(
        tickangle = 90,
        title_text = "Date",
        title_font = {"size": 15},
        title_standoff = 10),
    yaxis = dict(
        title_text = "Recovered Cases",
        title_standoff = 10),
    
)


fig3.update_xaxes(rangeslider_visible=True)
fig3.show()

Daywise Covid_19 Analysis:

Countries had the highest number of confirmed cases , Deaths Cases and recovered cases in one particular day.

In [244]:
fig = px.scatter(covid_time_series, x='Day', y='Confirmed',color="Country/Region",log_x=False ,log_y=True ) 
fig.update_layout(
    #height=800,width=1000,
    title_text= 'Daywise Confirmed Cases in All Countries ',xaxis = dict(
        tickangle = 90,
        title_text = "Day",
        title_font = {"size": 15},
        title_standoff = 10),
    yaxis = dict(
        title_text = "Confirmed Cases",
        title_standoff = 10),
    
)

fig.show()
In [245]:
fig1 = px.scatter(covid_time_series, x='Day', y='Deaths',color="Country/Region",log_x=False ,log_y=True )
fig1.update_layout(
    #height=800,width=1000,
    title_text= 'Daywise Deaths Cases in All Countries ',xaxis = dict(
        tickangle = 90,
        title_text = "Day",
        title_font = {"size": 15},
        title_standoff = 10),
    yaxis = dict(
        title_text = "Deaths Cases",
        title_standoff = 10),
    
)


fig1.show()
In [246]:
fig2 = px.scatter(covid_time_series, x='Day', y='Recovered',color="Country/Region",log_x=False ,log_y=True )
fig2.update_layout(
    #height=800,width=1000,
    title_text= 'Daywise Recovered Cases in All Countries ',xaxis = dict(
        tickangle = 90,
        title_text = "Day",
        title_font = {"size": 15},
        title_standoff = 10),
    yaxis = dict(
        title_text = "Recovered Cases",
        title_standoff = 10),
    
)


fig2.show()

Conclusion:

 - First cases of covid_19 is reported in Wuhan (the city where the virus originated)Central China, with a population of          over 11 million people.The city, on January 23.
 - 2-14 days represents the current official estimated range for the novel coronavirus COVID-19.
 - On January Month , the novel coronavirus cases in the UK,Russia,Sweden, Spain were reported less in number.
 - On March and April Months there is tremendous increase of Confirmed cases and Deaths Cases in all over World.
   ** Total Confirmed Case in World =3,913,644
   ** Total Deaths Case in World = 270,426
   ** Total Recovered Case in World=1,341,022   
   ** Total Confirmed Case in India = 56,342
   ** Total Deaths Case in India = 16,540
   ** Total Recovered Case in India =1,886
   (* current data reported in conclusion)
  • Age and conditions of Coronavirus cases in india reported TotalSamples 1.57851e+07 Negative 1.30916e+07 Positive 571595
  • Time series data show that their increase in confirmed cases rapidly .and also the death rate increase in countries .

  • Time Series data also show that recovered cases more than the death cases in countries .

  • Confirmed refers to a case being reported in contrast to a case being infected. Therefore, fluctuations in days .This is due to the sampling bias induced by the limited amount of corona test kits.

    -So many cases that do not yet show symptoms are tested and the mortality rate below is shown in dead per million inhabitants.

In [247]:
def hide_code_in_slideshow():   
    from IPython import display
    import binascii
    import os
    uid = binascii.hexlify(os.urandom(8)).decode()    
    html = """<div id="%s"></div>
    <script type="text/javascript">
        $(function(){
            var p = $("#%s");
            if (p.length==0) return;
            while (!p.hasClass("cell")) {
                p=p.parent();
                if (p.prop("tagName") =="body") return;
            }
            var cell = p;
            cell.find(".input").addClass("hide-in-slideshow")
        });
    </script>""" % (uid, uid)
    display.display_html(html, raw=True)
In [248]:
hide_code_in_slideshow()
In [249]:
#jupyter nbconvert presentation.ipynb --to slides --template output-toggle.tpl
#jupyter nbconvert Jupyter\ "covid analysis-final.ipynb" --to slides --template output-toggle.tpl --post serve --post serve